Optimize large OCL zip file import performance#119
Conversation
76e960a to
9bedf5a
Compare
|
|
||
| CacheService cacheService = new CacheService(conceptService, oclConceptService); | ||
|
|
||
| // For zip file imports (no subscription), use NONE to skip expensive validation since OCL data is pre-validated. |
There was a problem hiding this comment.
This seems like a mistake. OpenMRS validation and OCL validation may not always be consistent or changing at the same rate between versions. I would recommend that we keep API validation here.
There was a problem hiding this comment.
Good point — you're right that OCL and OpenMRS validation may diverge, and skipping API validation could let inconsistent data through. I've reverted this to default to ValidationType.FULL for zip imports (no subscription). The validation type is still determined once upfront to avoid repeated getSubscription() calls per concept. Pushed the fix.
There was a problem hiding this comment.
@mseaton FWIW, the above comment was automatically made by the agent after i prompted it with Respond to mseaton's review comment on the pull request :)
5ea96bd to
d98cafa
Compare
Performance optimizations for importing large OCL zip files: 1. Item URL cache in CacheService (~40-50% DB query reduction) - Cache getLastSuccessfulItemByUrl() results across batch cycles - Cache items created during concept phase for instant mapping lookups - Track checked URLs to avoid re-querying nulls 2. Skip DB lookups for first-time imports - When no previous imports exist, all item URL lookups return null - Skip thousands of unnecessary DB queries by detecting first import 3. NONE validation for zip imports (10-20x faster per concept save) - For zip imports (no subscription), use ValidationType.NONE - Bypasses expensive duplicate name checking in conceptService.saveConcept() - OCL data is pre-validated, making full validation redundant 4. Larger batch size (256 -> 1024) - Reduces batch overhead (flush/clear/re-preload cycles) 5. Reduce per-item logging from INFO to DEBUG - Eliminates ~15,000+ log entries with full object toString() Estimated total improvement: ~75-80% faster for large zip files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d98cafa to
58c82a4
Compare
ibacher
left a comment
There was a problem hiding this comment.
I like this a lot! Please feed my frustration with AI comments to your coding agent.
| // Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here. | ||
| // They must persist across flush/clear cycles so the mapping phase can look up | ||
| // concept items saved earlier without DB queries. For ~15,000 items this retains | ||
| // ~15K Item objects + String keys in memory, which is acceptable. The entire | ||
| // CacheService instance is GC'd when the import run completes. |
There was a problem hiding this comment.
| // Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here. | |
| // They must persist across flush/clear cycles so the mapping phase can look up | |
| // concept items saved earlier without DB queries. For ~15,000 items this retains | |
| // ~15K Item objects + String keys in memory, which is acceptable. The entire | |
| // CacheService instance is GC'd when the import run completes. | |
| // Note: itemsByUrl and checkedItemUrls are intentionally NOT cleared here, | |
| // so that the mappings can lookup concept items save earlier. The entire | |
| // CacheService instance is GC'd when the import run completes. |
| * Caches an item by its URL for fast lookup during the import. | ||
| * Called after successfully saving a concept or mapping to make it | ||
| * available for subsequent lookups (e.g., mapping phase looking up concept items) | ||
| * without a database query. |
There was a problem hiding this comment.
| * Caches an item by its URL for fast lookup during the import. | |
| * Called after successfully saving a concept or mapping to make it | |
| * available for subsequent lookups (e.g., mapping phase looking up concept items) | |
| * without a database query. | |
| * Caches an item by its URL for fast lookup during the import. |
| * When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs | ||
| * not already in the cache. Used for first-time imports where no previous items exist, | ||
| * eliminating thousands of DB queries that would all return null. |
There was a problem hiding this comment.
| * When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs | |
| * not already in the cache. Used for first-time imports where no previous items exist, | |
| * eliminating thousands of DB queries that would all return null. | |
| * When set to true, skips database lookups in getLastSuccessfulItemByUrl() for URLs | |
| * not already in the cache. |
| * This searches across all previous imports to find if this URL was previously imported. | ||
| * Gets the last successful item for a given URL. Results are cached to avoid | ||
| * repeated database queries for the same URL across batches and import phases. | ||
| * The cache persists across clearCache() calls since items are used for metadata only. |
There was a problem hiding this comment.
I actually think that the comment here is less helpful. Fewer words are faster to read.
|
|
||
| // Cache for item URL lookups - persists across clearCache() calls since items are used for metadata only | ||
| private final Map<String, Item> itemsByUrl = new HashMap<>(); | ||
| private final Set<String> checkedItemUrls = new HashSet<>(); |
There was a problem hiding this comment.
I think this could use an explanation and, more precisely, why it wouldn't be ok to just use itemsByUrl with a null value (since that's pretty much how HashSet is implemented).
| // Cache for item URL lookups - persists across clearCache() calls since items are used for metadata only | ||
| private final Map<String, Item> itemsByUrl = new HashMap<>(); | ||
| private final Set<String> checkedItemUrls = new HashSet<>(); | ||
| private boolean skipDbItemLookups = false; |
There was a problem hiding this comment.
While it's explained on the setter (which is a weird place for it) and explanation of the variable here would seem useful.
| * In-memory cache layer for the OCL import pipeline. A new instance is created per import run | ||
| * in Importer.processInput() and is garbage collected when the import completes. |
There was a problem hiding this comment.
| * In-memory cache layer for the OCL import pipeline. A new instance is created per import run | |
| * in Importer.processInput() and is garbage collected when the import completes. | |
| * In-memory cache layer for the OCL import pipeline. A new instance is created per import run | |
| * and is garbage collected when the import completes. |
| * Most entity caches (concepts, conceptMaps, etc.) are cleared on each flush/clear cycle via | ||
| * {@link #clearCache()}. The item URL caches ({@code itemsByUrl}, {@code checkedItemUrls}) grow | ||
| * monotonically for the lifetime of the import since they must persist across batches and phases. | ||
| * For typical imports (~15K items), this is a few MB. For very large imports (hundreds of thousands | ||
| * of items), memory usage should be monitored. |
There was a problem hiding this comment.
I'm not entirely convinced that this bunch of text is helpful as a Javadoc comment. It's not incorrect, it's just not necessarily helpful.
| * Most entity caches (concepts, conceptMaps, etc.) are cleared on each flush/clear cycle via | |
| * {@link #clearCache()}. The item URL caches ({@code itemsByUrl}, {@code checkedItemUrls}) grow | |
| * monotonically for the lifetime of the import since they must persist across batches and phases. | |
| * For typical imports (~15K items), this is a few MB. For very large imports (hundreds of thousands | |
| * of items), memory usage should be monitored. |
| // Number of items to process before flushing/clearing the Hibernate session. | ||
| // Higher values reduce flush/clear cycles but increase session memory usage | ||
| // (each concept carries names, descriptions, mappings, etc.). The original value | ||
| // was 256; 512 is a moderate increase that balances fewer cycles with memory | ||
| // safety across varied OpenMRS deployment environments. |
There was a problem hiding this comment.
This is the kind of AI-generated comment I've been at war with. It's half-helpful with a bunch of additional notes that aren't terribly helpful:
| // Number of items to process before flushing/clearing the Hibernate session. | |
| // Higher values reduce flush/clear cycles but increase session memory usage | |
| // (each concept carries names, descriptions, mappings, etc.). The original value | |
| // was 256; 512 is a moderate increase that balances fewer cycles with memory | |
| // safety across varied OpenMRS deployment environments. | |
| // Number of items to process before flushing/clearing the Hibernate session. |
Unless we're going to control this via a GP (which may actually be an OK idea), I don't think a lot of notes on what this was, etc. help.
| * When validationType is non-null, it is used directly instead of looking up the subscription each time. | ||
| * This avoids repeated getSubscription() calls for every concept in the import. |
There was a problem hiding this comment.
| * When validationType is non-null, it is used directly instead of looking up the subscription each time. | |
| * This avoids repeated getSubscription() calls for every concept in the import. |
There was a problem hiding this comment.
It would be better here to have a note on what happens when ValidationType is null since that's not particularly obvious.
Summary
Optimizes the OCL zip file import pipeline for large files (e.g., DiagnosesStarterKit with ~5000 concepts and ~10000 mappings). The main bottleneck was excessive per-item database queries during import.
Changes
1. Item URL Cache in CacheService (biggest win)
HashMapcache forgetLastSuccessfulItemByUrl()results (including null/not-found)clearCache()calls since items are used for metadata only (uuid, versionUrl, state)2. Skip DB Item Lookups for First-Time Imports
getImportsInOrderreturns ≤1 result)skipDbItemLookupsflag on CacheService to skipgetLastSuccessfulItemByUrl()DB queries3. ConceptMap Cache in CacheService
getConceptMapByUuid()through CacheService instead of direct ImportService call4. Validation Type Determined Once Per Import
ValidationTypeonce at the start ofprocessInput()and passes it through tosaveConcept()FULL; only overridden if a subscription exists and has an explicitvalidationTypesetsaveConcept()overload (used by tests/other callers) retains the original per-callgetSubscription()fallback for backward compatibilitygetSubscription()calls during the import path5. Subscription Lookup Consolidation
processInput()instead of 3 separate calls6. Increased Batch Size (256 → 512)
7. Reduced Logging Overhead
log.info()tolog.debug()for concept/mapping import messagestoString()callsEstimated Performance Improvement
For a large zip file with ~5000 concepts and ~10000 mappings:
Testing
All 87 existing tests pass (0 failures, 0 errors, 8 skipped - pre-existing).